Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instrumentation changes #55888

Merged
merged 4 commits into from
Jul 22, 2021
Merged

instrumentation changes #55888

merged 4 commits into from
Jul 22, 2021

Conversation

Maoni0
Copy link
Member

@Maoni0 Maoni0 commented Jul 18, 2021

We've never had ETW events that had fields about time - we rely on the the timestamps of the ETW events themselves to calculate time. This checkin introduces some new events/event fields that will include time info instead of firing individual events that otherwise don't carry much info, ie, they'd just be there so we could use their timestamps, which would be a nuisance when we have many heaps.

The rationale behind firing events with time already calculated is

  1. it does reduce overhead since we don't fire as many events so we can fire these in fewer events for informational level
  2. firing individual vents and having the tools interpret them isn't very useful unlike events such as GCStart/GCEnd which can be correlated with other events (eg, you get GCStart, and then get a bunch of other events so you know those happened after a GC started) wheras things like very GC internal things don't have this property, ie, we are not gonna care that "these other events happened during a GC and specifically during the relocation phase".

Added MarkWithType events for marking due to dependent handles, newly promoted due to dead finalizable objects and mark_steal. Perfview needs to be updated to work with this otherwise you can't see the GCStats view (I'm submitting a PR for that).

Recorded time for marking roots (but sizedref is separate), short weak, ScanForFinalization, long weak, relocate, compact and sweep.

Added a new version that includes the size of the object that triggered the event. This is for a request from this issue. I'm unclear if the config for the threshold was requested as a runtime config or something configurable via profiling API so asked for clarification there.

Provided a new rundown GCSettings event that has info on settings hard to get from traces.

Added a GCLOHCompact event which is fired for all heaps (heaps that didn't actually do LOH compact would have values of all 0s). I'm trying to add events that don't require a lot of correlation with other events to interpret. This is to help get an idea how long it takes to compact LOH and how reference rich it is.

Added a verbose level GCFitBucketInfo event which helps us with FL fitting investigation. I'm firing this for 2 things in a gen1 GC:

  1. for plugs that allocated with allocate_in_condemned_generations the event will capture all of them with the same bucketing as we do for gen2 FL;

  2. for gen2 FL we look at the largest free items that take up 25% of the FL space, or if there are too many of them we stop after walking a certain number of free items as we have to limit the amount of time we are spending here.


Fixed issues -

For BGC we were reporting the pinned object count the same as the last FGC..and that causes confusion so fixed that.

Fixed #45375

While fixing #45375, I noticed we have another bug related to alloc tick which is we are not firing the alloc tick events correctly for LOH and POH since the ETW alloc tracking didn't seperate them... fixed this too.

Added the POH type for GCSegmentTypeMap which was missing in the manifest.


Did some cleanup in eventtrace.h - we don't need the info that's not used which means we just ended up duplicating things like _GC_ROOT_KIND in more places than needed.


Note, I realize that I do have some inconsistency with the FEAETURE_EVENT_TRACE here, as in, some code should be under an #ifdef check but is not. I will look into a remedy for that with a separate PR.

@ghost
Copy link

ghost commented Jul 18, 2021

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

We've never had ETW events that had fields about time - we rely on the the timestamps of the ETW events themselves to calculate time. This checkin introduces some new events/event fields that will include time info instead of firing individual events that otherwise don't carry much info, ie, they'd just be there so we could use their timestamps, which would be a nuisance when we have many heaps.

The rationale behind firing events with time already calculated is

  1. it does reduce overhead since we don't fire as many events so we can fire these in fewer events for informational level
  2. firing individual vents and having the tools interpret them isn't very useful unlike events such as GCStart/GCEnd which can be correlated with other events (eg, you get GCStart, and then get a bunch of other events so you know those happened after a GC started) wheras things like very GC internal things don't have this property, ie, we are not gonna care that "these other events happened during a GC and specifically during the relocaton phase".

Added MarkWithType events for marking due to dependent handles, newly promoted due to dead finalizable objects and mark_steal. Perfview needs to be updated to work with this otherwise you can't see the GCStats view (I'm submitting a PR for that).

Recorded time for marking roots (but sizedref is separate), short weak, ScanForFinalization, long weak, relocate, compact and sweep.

Made the 100k alloc tick event threshold configurable (set with System.GC.AllocTick in runtimeconfig.json or the COMPlus_AllocTickThreshold env var) and added a new version that includes the size of the object that triggered the event. This is for a request from this issue.

Provided a new rundown GCSettings event that has info on settings hard to get from traces.

Added a GCLOHCompact event which is fired for all heaps (heaps that didn't actually do LOH compact would have values of all 0s). I'm trying to add events that don't require a lot of correlation with other events to interpret. This is to help get an idea how long it takes to compact LOH and how reference rich it is.

Added a verbose level GCFitBucketInfo event which helps us with FL fitting investigation. I'm firing this for 2 things in a gen1 GC:

  1. for plugs that allocated with allocate_in_condemned_generations the event will capture all of them with the same bucketing as we do for gen2 FL;

  2. for gen2 FL we look at the largest free items that take up 25% of the FL space, or if there are too many of them we stop after walking a certain number of free items as we have to limit the amount of time we are spending here.


Fixed issues -

For BGC we were reporting the pinned object count the same as the last FGC..and that causes confusion so fixed that.

Fixed #45375

While fixing #45375, I noticed we have another bug related to alloc tick which is we are not firing the alloc tick events correctly for LOH and POH since the ETW alloc tracking didn't seperate them... fixed this too.

Added the POH type for GCSegmentTypeMap which was missing in the manifest.


Did some cleanup in eventtrace.h - we don't need the info that's not used which means we just ended up duplicating things like _GC_ROOT_KIND in more places than needed.


Note, I realize that I do have some inconsistency with the FEAETURE_EVENT_TRACE here, as in, some code should be under an #ifdef check but is not. I will look into a remedy for that with a separate PR.

Author: Maoni0
Assignees: -
Labels:

area-GC-coreclr

Milestone: -

@Maoni0
Copy link
Member Author

Maoni0 commented Jul 18, 2021

@dotnet/gc please review.

@PeterSolMS please let me know if you are ok with how I structured the new GCFitBucketInfo event based on what we talked about.

@noahfalk PTAL at the changes in clretwall.man and the related interface changes.

@Maoni0 Maoni0 force-pushed the instru branch 2 times, most recently from a85f903 to eeccd35 Compare July 19, 2021 00:16
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looked fine to me. A few thoughts overall:

  1. I wasn't sure if the config mechanism for AllocTick events will be readily usable by profiler vendors. They might prefer a profiler API.
  2. Emitting the rundown settings at startup too is an option if it is useful now or in the future
  3. Given that these events are mostly intended for GC team's own analysis I didn't push for docs.

src/coreclr/vm/ClrEtwAll.man Outdated Show resolved Hide resolved
src/coreclr/vm/ClrEtwAll.man Outdated Show resolved Hide resolved
src/coreclr/vm/ClrEtwAll.man Outdated Show resolved Hide resolved
src/coreclr/vm/ClrEtwAll.man Outdated Show resolved Hide resolved
src/coreclr/vm/ClrEtwAll.man Show resolved Hide resolved
@noahfalk noahfalk mentioned this pull request Jul 22, 2021
10 tasks
@Maoni0 Maoni0 merged commit 921bca0 into dotnet:main Jul 22, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GCAllocationTick_V2 ETW events are reporting wrong allocation sizes in multithreaded application
5 participants